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Abstract 

Applying equal testing and verification effort to all parts of a software system is not very’ 
efficient, especially when resources are limited and scheduling is tight. Therefore, one 
needs to be able to differentiate low / high fault frequency components so that testing / 
verification effort can be concentrated where needed. Such a strategy is expected to detect 
more faults and thus improve the resulting reliability of the overall system. This paper 
presents the Optimized Set Reduction approach for constructing such models, intended to 
fulfill specific software engineering needs. Our approach to classification is to measure the 
software system and build multivariate stochastic models for predictiing high risk system 
components. We present experimental results obtained by classifying Ada components into 
two classes: is or is not likely to generate faults during system and acceptance test. Also, 
we evaluate the accuracy of the model and the insights it provides into the error making 
process. 

Key words: Optimized Set Reduction, data analysis, fault-prone Ada components, 

stochastic modeling, machine learning, classification trees, logistic regression. 

1 Introduction 

It has been noted that a small number of software components are responsible for a disproportionately 
large number of faults in any large-scale system [BP84, SP88, MK92]. Therefore, if we can identify 
components that are likely to produce a large number of faults, we can concentrate the verification and 
testing processes on them. This allows us to optimize the reliability of our software system with 
minimum cost. To do this, we build quantitative models that predict which components are likely to 
contain the highest concentration of faults. However, building such models is a difficult task: it is often 
the case in software engineering that the data which is collected is minimal, incomplete and 
heterogeneous [BBT92], This presents several problems for model construction and interpretation (e.°., 
small data sets, inaccurate models, outliers). Therefore, we need a modeling process that is robust to 
these problems, allows for the reliable classification of high risk components (those that have a high 
probability of generating a fault during system or acceptance test), and aids in the understanding of the 
causes of this high risk. This understanding is important because it can give us insight into the software 
development process, allowing us to take remedial actions and make better process decisions in the 
future. 

In this context, we will examine the use of the following modeling approaches: 

• Logistic regression, which is one of the most commonly used classification techniques [Agr90, 
HL89]. This technique has been applied to software engineering modeling [MK92], as well as 
other experimental fields, and will therefore be used as a baseline for comparison in this paper. 


1 Research for this study was supported in part by NASA gram NSG 5123 and NSF grant 01-5-24845 
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Many assumptions and constraints inherent to this technique make it'difficult to apply in a 
software engineering context: (1) non-monotonicity of the probability density function on the 
explanatory variable range (2) interactions between explanatory variables are difficult to take into 
account when performing exploratory data analysis with numerous explanatory variables. 

• Classification trees, which are described in [BF+84], They were used to address software 
engineering modeling issues in [PA+82, SP88]. A review may be found in [CE87, BBT92J. 
Their strengths stem from their simplicity and readability. Their weaknesses come from a lack of 
ability to extract and use all statistically significant trends and a tendency to include non-relevant 
and non-significant information in the tree. 

• Optimized Set Reduction (OSR), which has been developed at the University of Maryland 
[BBT92] in the framework of the TAME project [BR88] and has already been applied to several 
software engineering applications [BBT92, BBH92, BTH93]. It is partially based on both 
machine learning principles [Q86, BF+84], and univariate statistics [Cap88]. Our motivation for 
developing OSR, and a tool to support it, was to design a data analysis approach that matches, to 
the extent possible, the specific needs of multivariate empirical modeling for software 
engineering [BBT92], OSR generates logical expressions which represent patterns in a data set. 
For instance, consider the following example of a simple pattern (logical expression) related to 
high fault concentration: 

Example 1: 

A compilation unit that imports numerous declarations from outside the subsystem in which it is 
developed, that shows a large average statement nesting level and an intense use of global 
variables is likely to generate fault reports during system and acceptance testing. The 
corresponding logical expression characterizing this class of compilation units would be: 
NONLOCJMP = High a (NESTING = High a GLOBALS = High) 

In this paper, we intend to show that OSR may be used as an alternative to logistic regression or 
classification trees to generate empirical models of risk within a software system, and that it can yield 
more accurate results. We will discuss issues related to the interpretation of the generated models. In 
particular, we will demonstrate how OSR can be useful in (1) identifying characteristics of high-risk 
components in a large Ada system and (2) providing some understanding about how faults originate 
during the software development process 

In Section 2, we present an evolved version of the OSR algorithm (an earlier version of the OSR 
approach was applied to project cost estimation and published in [BBT92]) which is intended to make 
OSR models more accurate and easier to interpret. Specifically, the new algorithm improves the 
interpnetability and the accuracy of the models in three ways. First, it provides a mechanism for dealing 
with the discretization of the explanatory variable ranges in an automated way. This better supports the 
requirement that our models need to be able to handle the problem of heteroscedascity (see R5 in 
[BBT92]) Secondly, we provide OSR with the ability to work with conjunctive predicates (which will 
be called predicates in this paper), allowing our models to elicit the effects of combinations of variables 
which were not visible in the previous version of OSR. Finally, we provide support for recognizing 
similarities among patterns, which aids the user in model interpretation. These second and third 
adaptations help OSR deal with the requirement that our models are able to handle interdependencies and 
interactions among the explanatory variables (see R4 in [BBT92]). 

Also in contrast to [BBT92], this paper applies the OSR modeling technique to the issue of classifying 
Ada components as either low or high risk, as opposed to project cost estimation (prediction on a 
continuous range). Accordingly, we use logistic regression and classification trees as a baseline for 
evaluating the OSR results. (Preliminary and partial results of this research were presented in [BBH92] 
based on the analysis of FORTRAN systems). 
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In Section 3, we present a validation of the OSR process, which is based on constructing models using 
data from a large Ada system developed at the NASA Goddard Space Bight Center. In Section 3.2 we 
compare the generated OSR models to both logistic regression and classification tree models with respect 
to their accuracy. In Section 3.3, we discuss the interpretability of the OSR models. Finally, in Section 
4, we outline the main conclusions of this paper and define the future directions of the research. 


2 Optimized Set Reduction 

Assume we want to assess a characteristic of an object. We will refer to this characteristic as the 
dependent variable (Y). The object is represented by a set of explanatory (known or assessable) 
variables (called Xs). These variables can be either discrete or continuous. Also, assume we have a 
historical data set containing a set of experiences that contain the previously cited Xs plus an associated 
actual Y value. Our goal will be to determine which subset of experiences from the historical data set 
provides the best characterizations of the current object to be assessed. 

Example 2. Assess the expected frequency of faults (Y) that will be detected during system and 
acceptance test within a particular compilation unit. For instance, the Xs may be: complexity 
metrics, system architecture metrics or developer related evaluation of skills. 

2.1 The OSR Process 

First, we will introduce new terminology in an attempt to both formalize the intuitive concepts related to 
empirical modeling and give those concepts a firm grounding in the OSR context. Subsection 2.1.1 
presents the notions informally to provide the reader with some intuition about the method. The rest of 
Section 2 will be more structured and formal in order to define more complex notions without 
ambiguity. Whenever needed, definitions will be formal specifications whereas others will be in 
algorithmic form. 

2.1.1 Basic Definitions 

Assume we have a historical data set consisting of n experiences, where each experience consists of a 
value for a single dependent variable (Y) and a set of values corresponding to a set of m explanatory 
variables (EV = {X],X2,...,X m }). We define the term pattern vector to mean one of these such 
experiences. Assume the dependent variable's value domain (dom(Y)) is divided into a set of disjoint 
and exhaustive classes which can be either intervals (if the Y is continuous) or categories (if the Y is 
discrete). Each explanatory variable has its own value domain (dom(Xi)) which, like dom(Y) is divided 
into a set C of disjoint and exhaustive value classes C = {Classy Classy Classy}. We define a 
measurement vector to be a pattern vector without the dependent variable Y. (Note that a measurement 
vector can be used to represent an object whose dependent variable value is not known, but is of interest 

and which we wish to assess). The measurement vector value domain is MV = x dom(X ) 

T _ i £ (1 . m) 1 

Likewise, the pattern vector value domain (i.e. , the domain of the vectors in the data set) can be 
represented as PV = dom(Y) x MV. We define PVS £ PV to be a pattern vector set , representing the 

h 1 Ctnn rnf rintsi r o 


Example 3: Suppose (Size = 100 LOC's, Function_type = computation) is a measurement 
vector characterizing a compilation unit Assuming Y is #faults, ( ^faults = 6, Size =100 LOC’s, 

Function_type = computation) is a pattern vector characterizing a particular testing experience on 
a compilation unit 
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At the very heart of the OSR process, is what we call a singleton predicate. We define a singleton 
predicate to be a pair with the following form: (Xj .Classjj) meaning that explanatory variable Xj has a 
value belonging to Classij £ dom(Xi). A singleton predicate (also written Xj e Classjj) is said to be 
TRUE for a measurement vector if that vector's explanatory variable Xj value is an element of Classij, 
otherwise, the singleton predicate is said to be FALSE for that vector. 

Example 4: Size e [50, 200) is a singleton predicate 

Now that we have defined the notion of a singleton predicate, we can define other elements of OSR 
which are built upon this notion. For instance, we can define a conjunctive predicate (denoted Pred and 
simply called a predicate from here on) as the conjunction of singleton predicates. We will consider a 
predicate to be a set of singleton predicates, where the conjunction is implicit. A predicate is said to be 
TRUE for a given measurement vector if each of its constituent singleton predicates is TRUE for that 
vector. (Note that by defining a predicate to be a set (conjunction) of singleton predicates gives OSR the 
ability to elicit some of the complex interdependencies that exist between the explanatory variables, see 
requirement R4 in [BBT92]). 

Example 5: Size e [50, 200) a Function_type e {computation} is a predicate 

A predicate may be used to characterize sets of pattern vectors. For example, if we define 
IS_TRUE(Pred, pv) to yield TRUE if Pred is a true logical expression for the pattern vector pv, (i.e., 
each singleton predicate in Pred is true for pv), then we can define a predicate Pred and a subset PSS of 
the historical data set (PVS) such that IS_TRUE(Pred, pv) yields TRUE for each pv in PSS. Similarly, 
we define SUBSET(PSS, Pred) to denote a subset of PSS characterized by Pred. Also, we define PSS 
to be equivalent to SUBSET(PSS, TRUE). Finally, MEMBER(X, Pred) yields the value TRUE if the 
variable X appears anywhere in Pred, FALSE otherwise. 

2.1.2 Optimal Subsets of Experiences 

In this section, we rigorously define the notion of "optimal subset of experiences” by defining the 
function OPT that extracts these subsets from a given historical data set. We will see in the next section 
that OPT is not directly implementable. Nonetheless, this definition should help the reader understand 
our goals at a first glance. These definitions, by their very nature are somewhat terse. However, the 
accompanying explanations should help the reader get an intuitive understanding of the process. 

* Definition 1: Normalized Entropy H(PSS, Y) 

This is the information theory definition of entropy that characterizes distributions, normalized to yield a 
value between 0 and 1. This concept is commonly used in machine leaming[M83] in order to assess the 
level of information provided by a distribution on a continuous or discrete range. It yields a value 0 
when unambiguous information is provided and 1 when no information is provided. 

H(PSS, Y) = - ^ P(PSS, Class Yj)log, ci p(PSS, Class Yj) 

ClassYj c C 


where, 

. PSS is a set of pattern vectors 
. ClassYj is a class defined on dom(Y) 

. p(PSS, Yj) is the prior probability that a vector which is an element of PSS has a dependent 
variable value belonging to the dependent variable class Yj 
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• Definition 2: DIFFDIST(PSSj, PSSj, Y) 

DIFFDIST(PSSj, PSSj, Y) = TRUE if the two sets of pattern vectors characterized by PSSj and PSSj 

show a statistically significant DIFFerence in DISTribution on the dependent variable (Y) range and is 
FALSE otherwise. This function is based on binomial tests for proportions and is better described in 
[BBT92], The statistical level of significance used as a threshold between TRUE and FALSE is 
subjective and is therefore defined by the user (e.g., 0.05, 0.1). 

• Definition 3: VALID(PSS, mv) 

This function yields TRUE if at least one predicate is TRUE for all the pattern vectors in PSS and for the 
measurement vector mv. 

PSS c PVS a mv e MV a BPred such that (V pv e PSS , IS_TRUE(Pred , pv) a 
IS_TRUE(Pred , mv) ) => VALID(PSS, mv) 

• Definition 4: EMIN(PSS, PSSj, Y) 

EMIN(PSS, PSS;, Y) = TRUE if PSSj, a subset of PSS, shows a significantly different distribution 

from PSS on the Y range (based on a predefined level of significance and according the result of the 
function DIFFDIST) and for all other subsets PSSk of PSS showing a statistically significant Y 
distribution, H(PSSj, Y) ^ H(PSSk, Y). EMIN stands for: Entropy is MINimum. In other words, 
EMIN tells us if PSSj characterizes a subset with minimal possible entropy and that this low entropy is 
not likely to be due to chance. 

PSS c PVS a PSSj c PSS a (DIFFDIST(PSSj, PSS, Y) a (VPSSk <= PSS, k*j, 
DIFFDIST(PSSk, PSS, Y) a H(PSSj, Y) < H(PSSk, Y))) => EMIN(PSS, PSSj, Y) 

• Definition 5: OPT(PVS, mv, Y) 

OPT yields a set of OPTimal subsets of pattern vectors of PVS (the historical data set) based on the 
definitions presented above. These subsets are characterized by predicates which are built based upon 
known information (i.e., mv) and show a minimal entropy. They can therefore be used for predicting 
the value of Y with respect to mv. 

Example 6: In Figure 1, based upon a given measurement vector (mv ) and a given historical 
dataset, the optimal subset extracted by OPT and characterized by the predicate on the left hand 
side of Figure 1 indicates a strong probability for Y to lie in the interval Y2. This may be used 
for predicting the class where the object described by mv is likely to lie. Also, if Y is defined on 
a continuous scale, the optimal subset expected value may be used as a prediction. 
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Figure 1: Classification with Extracted Subsets 


Based on the primitives defined above, OPT may be defined as follows: 

OPT(PVS, mv, Y) = {PSS I PSS C PVS a VALID(PSS, mv) a EMIN(PSS, PVS, Y) } 

The function OPT as defined above defines optimal subsets of experiences with minimal entropies and 
characterized by optimal predicates. However, this is just a first step in the definition of an optimal 
search algorithm to extract datasets’ patterns since there are several reasons why this simple function is 
not fully adequate to build empirical models to fulfill our needs. Some of these reasons are simply 
computational in nature while others are related to the loss of useful information. 

• 1: The number of possible singleton predicate combinations makes the execution time of the search 

of optimal predicates prohibitive without a search strategy. 

• 2: We are not only interested in the optimal subsets extracted by OPT but also by the predicates that 

characterize them. We want each generated predicate to contain only singleton predicates that 
have a significant impact on the resulting distribution entropy (see Figure 1). Thus, we can 
minimize the information necessary to identify optimal subsets and make the predicates more 
interpretable. 

• 3: We need to extract information about the relative impact of the various singleton predicates 

within the optimal predicates. 

• 4: The conditions under which singleton predicates or predicates appear relevant have to be 
determined. 


Therefore, we will now define an algorithm which addresses these issues, discussing its relationship to 
the function OPT. This is the Optimized Set Reduction process which can roughly be described by a 
three step recursive algorithm where entropy is optimized in a stepwise manner. 

2.2 The OSR Algorithm 

The goal of the OSR algorithm is to produce a set of patterns which characterize the trends observable in 
the historical data set while addressing the four modeling issues mentioned above. In this context, the 
notion of pattern is based upon the notion of predicate as defined above while addressing some of the 
mentioned modeling needs. This definition of pattern intends to be both useful for predicting and 
suitable to interpretation. 
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In subsection 2.2.2, we shall describe the OSR algorithm in detail. However, before doing so, we need 
to define a number of preliminary concepts that are used in the algorithm. 

2.2.1 Preliminary Definitions 

• Definition 6: OSR Pattern 

As mentioned above, OSR generates patterns. A pattern is an ordered conjunction of predicates which 
characterizes a subset of PVS that shows a minimal entropy distribution. The notion of ordering will be 

represented by the "ORDERED AND" symbol ( *). It is logically equivalent to the symbol (a) with 

the exception that predicates to the right of a * symbol are relevant only when all predicates to the left of 

the symbol are already TRUE. The notion of order is introduced here to capture information about the 
conditions under which a predicate is relevant and does not have any logical impact on the 
characterization of optimal subsets We will call the ordered expression to the left of a given predicate in a 
pattern the context of the predicate. This addresses issue number 4 mentioned above. 


Example 7: Define two predicates 

Predi = SUBSYSTEM e REAL-TIME CONTROL a SUBSYSTEM e LARGE 
Pred 2 = #GLOBAL VARIABLES e LARGE. 

If we assume the pattern Predi ^ Pred2 was generated by OSR, we can see that this pattern 

characterizes a pattern vector set suggesting a high risk which is defined, in this particular 
example, as the probability of detecting errors that are difficult to correct during the test phases 
(see Figure 2). 

This pattern {Predi ^ Predi) has a specific interpretation associated with it. Predi is a non- 
singleton predicate and Predi is relevant within the context of Pred ; . This pattern implies the 
following interpretation. If a subsystem is both large and real time, then it is significantly more 
likely to be of high risk than a random subsystem. However, it does NOT suggest that either real 
time subsystems or large subsystems independently increase the probability that a subsystem will 
be of high risk. Also, within the context of large, real time subsystems, subsystems with a large 
number of global variables have a significantly greater probability of being high risk than those 
with a small number of global variables. However, this pattern does NOT suggest that a large 
number of global variables has a significant impact on the probability that a subsystem will be of 
high risk outside the context of large, real time subsystems. (More details concerning pattern 
generation and interpretation will be presented later in the paper.) 
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Figure 2: Classification Using Patterns 
• Definition 7: DISCRETIZE(PSS, X; ) 

Given a particular subset of pattern vectors (PSS), we want to divide/cluster the ranges/categories of the 
explanatory variables into an exhaustive and disjoint set of classes (Classj i ... Classy for the 
explanatory variable Xi) based on a meaningful class creation techniques. This is used to both define 
singleton predicates and to better satisfy the problem of heteroscedascity, i.e., requirement R5 of 
[BBT92] which states that an explanatory variable may be a good predictor on a part of its range/value 
domain while a mediocre predictor otherwise. Clustering of discrete categories can only be performed by 
the user by defining taxonomies. Numerous techniques are available in the literature to create intervals 
on continuous / ordinal ranges (e.g., cluster analysis) [DG84], However, none appear to have 
satisfactory properties for our problem. Therefore, classes are created for continuous / ordinal 
explanatory variables according to the procedure DISCRETIZE briefly presented below and described in 
Appendix n. 

DISCRETIZE(PSS, Xj ) defines classes on the range of Xj (a particular continuous or ordinal 
explanatory variable) based on a pattern vector subset PSS. This algorithm has the following properties: 


• Either all or some of the classes should show distributions on the Y range that are significantly 
different than the distribution resulting from the union of those classes. If not, differentiating these 
classes and creating new pattern vector subsets is meaningless. 

• The algorithm handles monotonic and non-monotonic underlying distributions on the Y range. 

•The algorithm is not oversensitive to the addition or deletion of few pattern vectors so stable 
patterns are generated. 


Our goal is to take into account the above constraints and to minimize the average entropy across the 
created classes in order to have classes as homogeneous as possible with respect to the dependent 
variable values of their pattern vectors. Figure 3 illustrates the output of the algorithm. We assume an 
actual underlying and unknown non-monotonic probability density function and an observed sequence 
of Y values on the explanatory variable X range. We also assume two classes (1, 2) are defined on the Y 
value domain . Using the DISCRETIZE algorithm produces Boundary! and Boundary2 in Figure 3, 
which creates the corresponding set of three explanatory variable value classes across the X range. 
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Figure 3: Discretization Process 
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• Definition 8: GENERATE_SINGLETONS(PSS, mv) 

Let PSS represent the considered pattern vector set and let mv be a measurement vector. The classes 
defined by DISCRETIZE for each explanatory variable Xj give us a set of singleton predicates: {Xj e 
Classil,..., Xi e Classic}. GENERATE_SINGLETONS(PSS, mv) generates the set of all singleton 
predicates SP such that SP = {Predj I IS_TRUE(Predj, mv)}. 

• Definition 9: SIG_PREDICATE(PSS, Pred, Y) 

The predicate Pred is said to be significant for the data set PSS if SUBSET(PSS, Pred) shows an 
entropy lower than the one of PSS and if their distributions on the Y range show statistically significant 
differences. 

PSS £ PVS a (H(SUBSET(PSS, Pred), Y) < H(PSS, Y) a 
DIFFDIST(PSS, SUBSET(PSS, Pred), Y)) => SIG_PREDICATE(PSS, Pred, Y) 

Example 8: Assuming two dependent variable classes ([low, high]), suppose Pred 
characterizes a subset whose distribution across the two classes is [10, 7], This subset shows an 
entropy which is lower than the entropy of PSS, which had a distribution [100, 75], but the 
difference is not statistically significant since the proportion of pattern vectors in each class is 
practically the same. A binomial test for proportions [Cap88] is used to assess the significance of 
the observed difference in entropy. 

• Definition 10: MINIMAL(PSS, Predj, Y) 

The predicate Predj is said to be minimal for the pattern vector set PSS if it characterizes a subset of 
PSS which shows a significantly different distribution across the Y classes and there exists no other 
predicate Predj =*> Predj such that Predj characterizes a subset of PSS which shows a significantly 
different distribution across the Y classes. Otherwise, Pred j contains more singleton predicates than is 
necessary to significantly improve the entropy and is not considered to be minimal. 

S IG_PREDIC ATE(P S S , Predj, Y) a (Vj, Predj => Predj, j^i, (^SIG.PREDICATCfPSS, Predj, Y) ) 
=* MINIMAL(PSS, Predj, Y) J 


10014023L 


2-11 


Example 9: Assume that the predicate Pred] = SUBSYSTEM € REAL-TIME CONTROL a 
SUBSYSTEM e LARGE yields, in a defined context, an entropy of 0.5 (assumed to yield a 
significanuy different distribution from the parent set). If Pred2 = SUBSYSTEM € REAL- 
TIME CONTROL by itself yields an entropy of 0.5, Pred j is not minimal. 

• Definition 11: VALID_PREDICATES(PSS, PRED C , SP, Y) 


Let PSS represent a set of pattern vectors and PRED C be a set of predicates which define the context 
characterizing PSS. Let SP be a set of singleton predicates and Y be the dependent variable. 


Assuming that the set SP has been created by using GENERATE_S INGLETON S , we generate the set 
of all predicates which are conjuncts of the singletons in SP and which are minimal with respect to PSS 
(as defined above), as long as they do not use any explanatory variable X that appears in PRED These 

predicates are called valid and are the ones that appear potentially useful for extracting subsets of PSS 
with high predictive power for mv on the Y range. With respect to the implementation of this procedure 
the user may restrict the search space by fixing a maximum number of singleton predicates per predicate! 
However, some complex but meaningful predicates may not be extracted by doing so. 


V ALID_PREDIC ATES (PSS, PRED C , SP, Y) = {Pred; I MINIMAL(PSS, Predj, Y) a p re di £ SP a 

(Vj, Pred: e PRED C , Vx such that X e { Xk I Xk e EV a MEMBER(Xk, Pred )} 

— >MEMBER(X, Predj) } J 


• Definition 12: EXTRACT_SUBSETS (PSS, PRED) 

Let PRED be a set of predicates. A set of subsets, where each subset is characterized by one and only 
one predicate m the set PRED, is extracted from PSS. y 


EXTRACT_SUB SETS (PSS, PRED) = {PSS; I Predj € PRED a PSSj = SUBSET(PSS, Predj )} 
2.2.2 The Algorithm 


When the dependent variable's value domain is defined on a continuous scale, its range is assumed to be 
divided mto intervals / classes. These classes are fixed and will be used throughout the algorithm These 
intervals areusually defined according to two main criteria: the size of the dataset and the specific use of 
the model. The larger the data set, the narrower the classes may be so that the model can produce a more 
accurate response. Also, the definition of these classes must also take into account the future use of the 
Prions C ^ ’ t " e ^' re P resenl c ^ asters on Y range or a finite number of situations suggesting alternative 


Example 10: 

Assume that the range of the dependent variable (Y) is an integer range from 0 to 5, indicating 
the number of fault reports that were generated for a component during system and acceptance 
test. Then, we may decide to define the following dependent variable classes: 

ClassYl = Y in [0, 1) Low Risk Components 
ClassY2 = Y in [1, +«>) High Risk Components 


Let PSS be a set of pattern vectors, let mv be a measurement vector characterizing the object to be 
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classified on the Y range, and let PRED C be the set of predicates composing the pattern characterizing the 
set PSS. Recall that we cannot use OPT directly. However, OSR(PSS, rav, PRED C , Y) heuristically 
returns a set of "optimal" subsets using the algorithm defined below. 

OSR(PSS, rav, PRED C , Y) 

• Step 1: SP = GENERATE_SINGLETONS (PSS, mv) 

/* Generate a set of optimal singleton predicates based on the pattern vector set PSS */ 

/* and for the measurement vector mv */ 

• Step 2: PRED = VALID_PREDI CATES (PSS, PRED C , SP, Y) 

/* Generate all the valid predicates based on the available set of singleton predicates */ 

/* SP, the current context defined by PRED C , and its corresponding pattern vector set PSS. */ 

• Step 3: 

if PRED = 0 /* no Predicates have been created at Step2 */ 
return PSS ; 

{ 

/* A subset is extracted for each valid predicate created at step 2 */ 

/* OSR is called recursively for each of these extracted subsets */ 
for all PSS i e EXTRACTJSUBSETS ( PSS, PRED) do 

{ 

/* the context of PSS^ is now the context of PSS union Pred^ */ 

PREDi = PRED C U Pred i ; 

/* call OSR for the subset PSS^ */ 

OSR(PSS i; mv, PREDi, Y) ; 

} 

} 

Initially, call OSR(PVS, mv, 0, Y) where PVS is the historical data set. 

The OSR algorithm can be viewed as a recursive function of OPT as described below. PVS is the 
historical data set and mv the vector describing the object to be assessed. Let us assume we modify the 
definition of the function VALID, which is used to build OPT, so that the function MINIMAL is 
included in it. Then, VALID becomes the following: 

PSS c PVS a ra v e MV a 3p rec^ such that (V pv € PSS , IS.TRUECPredj , pv) a IS-TRUEtPred; , 
mv) a MINIMAL(PSS, Predj , Y) ) => VALID(PSS, mv, Y) 

Then, assuming the definition of OPT uses this new definition of VALID, we can then define OSR in 
the following way: 


OSR(PSS, mv, PRED C , Y) 


f U(OSR(PSS:, mv, PRED C u Pred:, Y)), ifOPT(PSS, mv, Y) * 0 

) a*. Y) 1 

I (PSS), otherwise. 


Note that at each level of recursion, a minimal subset of pattern vectors is extracted. These recursively 
nested, extracted subsets are each characterized by a predicate in a context. Thus, if we implicitly order 
the paths, the ordered conjunction of predicates along each recursive path is a pattern (see Definition 6 
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and Figure 4). 


The subsets of PVS extracted by OSR for a particular mv may be used for the classification of Y for mv 
Also, if patterns are extracted for each mv in PVS, the resulting set of patterns may be used for the 
interpretation of the impact of the explanatory variables on the dependent variable in a particular 
development environment These issues will be addressed in the next sections. 

Example 1 1 

In Figure 4, we can see how OSR patterns are generated during the subset extraction process At 
the first (highest) level in the hierarchy, suppose <NUM_IMPORTS = HIGH> is a predicate 
which is minimal , causing the extraction of Subset 1. At the second level, suppose the two place 
predicate <NESTING=HIGH a CMPLX=HIGH> was found to be minimal. Then, by tracing 
the hierarchy down this particular path, OSR generates the following pattern, which corresponds 
to the extracted subset 1.1: 


NUMJMPORTS = HIGH a (NESTING=HIGH a CMPLX=HIGH) 

Also, each path in the hierarchy from the top set (PVS) to a bottom level subset is marked by its 
own pattern. Thus, OSR creates a set of patterns, (i.e. all the paths in the hierarchy). 


NUM.IMPORTS = High 


NESTING = High 
A 

CMPLX = High 


Historical 
data set (PVS) 



Subsetl.I SubsetU Subset2.1 Subset2J 


cz> 


Extracted subset 


\ 


"Subset of' relationship 


Figure 4: An Example of OSR Hierarchy 

Each path of the hierarchy represents a path that the extraction process may have taken during OSR. 
Accordingly, each path is characterized by an ordered conjunction of predicates, i.e., a pattern. Each 
final extracted subset (i.e., leaves of the hierarchy) forms a probability distribution across the dependent 
variable range. This distribution is a valuable piece of information and can be used in several ways. For 
instance, if the dependent variable is discrete, the dependent variable class containing the largest number 
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of pattern vectors may be selected as the most likely class for the new object's Y value to lie in 
Alternatively, we may consider using a Bayesian approach. That is, we could define a loss/risk function 
[BBT92] and select the dependent variable class yielding the minimum expected loss. Finally, note that 
several leaves may have distributions that yield contradictory or dissimilar trends. Therefore, several 
pattern classifications (i.e., hierarchy leaves) are used to make a final global classification based on 
predefined decision rules. In order to perform such decisions effectively, we need to be able to evaluate 
the accuracy of the identified patterns (e.g., hierarchy branches). This is the topic of the next subsection. 

2.3 Assessing the Accuracy of Patterns 

In order to generate patterns and assess their accuracy, we use OSR in the context of the technique called 
V-fold Cross Validation [BF+84], For each pattern vector pv in the historical data set, we can run the 
OSR algorithm using PVS - //tv/} as the initial data set and using the measurement vector composing pv 
as mv. The pattern vector pv is removed from the data set in order to avoid any bias in the results. Thus, 
each time we run OSR, we know the actual value of the dependent variable we are trying to classify. 
This allows us to not only extract specific patterns for each pattern vector in the data set, but we are also 
able to classify each generated pattern as right or wrong at the time it is generated. The set of patterns 
generated through this iterative process forms a representation of the trends observable on this particular 
data set which we will call a Specific Pattern Set (SPS). 

The SPS may be viewed as a hierarchical model (see figure 4) of the historical data set. Many of the 
patterns in the SPS will be the same or similar and will therefore form classes of patterns. For each of 
these classes, based on the SPS, we can evaluate statistics such as pattern reliability (i.e., percentage of 
correct classification when the pattern is used) and pattern reliability significance (i.e., the probability 
that the observed reliability is greater than or equal to the one expected through a random classification 
by chance). These statistics can then be used to evaluate the pattern based predictions as explained in the 
subsequent paragraphs. Thus, even though incomplete / partial information is available in the historical 
data set, accurate patterns may still be generated in some cases. 

Recall that we assumed the patterns generated by OSR have the following ordered conjunctive normal 
form: 


Predicate 1 * Predicate2 * ... * PredicateN 

Also, recall the order in which the predicates appear is relevant in order to determine the contexts where 
they are relevant. A predicate is relevant only when the conditions defined by its preceding / parent 
predicates (i.e., the context of a predicate) are true. 

Let ClassYj be dependent variable class i. Let T be the number of generated pattern instances Pattern; 
that predict ClassYj. Let C be the number of pattern instances which correctly predict ClassYj (based on 
the actual Y value of the pattern vector for which the pattern was produced). 

Then we define the reliability of Pattemj with respect to the dependent variable class ClassYj as: 

R [ClassYj ; Pattemj] = C / T 

The probability that a pattern appears T times yielding a particular classification ClassYj C times 
correctly by chance (P(C,T,p) ) can be expressed by the binomial distribution: 

p(c ’ T ' p)= cl?=cji pC(, ' p ) M 
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where, p = p(ClassYj) , i.e., the prior probability that the value of the dependent variable is in ClassYi. 

If the pattern reliability R is equal to 1.0, then the binomial equation can be simplified and the level of 

significance is simply p T . If R is below one, then the pattern reliability significance RS can be calculated 
using the following formula: 


RS = T IP(C + j;T;p) 

j=o 

Example 12: For a given pattern, suppose that: 

C = 10 (the number of limes that the pattern was correct during the 
V-fold Cross Validation) 

T = 12 (the total number of times the pattern was generated) 

Also, suppose that there are exactly two dependent variables classes and an uniform distribution 
in the historical data set, so that the prior probability of a pattern predicting each class is 0.5 for 
each dependent variable class. 

Then, using the above formulas, this pattern has the following reliability and reliability 
significance. 

R = 0.83 

RS = 0.019 

Since we are able to differentiate significantly reliable patterns from the non-significant and/or unreliable 
ones, we are able to know the reliability of a classification when we make it. That is, when we are trying 
to assess a new object, we run the OSR algorithm using that object as the measurement vector. This 
process extracts a set of patterns specific to that object Then, when making a classification for this 
object, we know that a classification based on a reliable pattern with a sufficient level of significance 
(e.g., RS < 0.05) is believable, whereas, one based on a reliable pattern with a poor level of significance 
is not 

Thus,^ 9^ decision process is based on the R's and RS’s of each pattern in the hierarchy. Pattern 
reliability is used for classification while the variations in pattern entropy are used for interpretation. 
Although a reliable pattern always shows a low entropy, the opposite is not true (for reasons beyond the 
scope of this paper). 

Note: a poor reliability means that a pattern is not robust to "noise" (i.e., the dependent variable 
variations created by non-measured phenomena). A poor reliability significance may mean that the 
pattern is a result of noise or more complex phenomena resulting from the OSR process (again beyond 
the scope of this paper. 

2.4 Support for Interpreting Patterns 

As we have seen patterns are useful for classifying variables of interest. However, more importantly, 
they are also useful in providing understandable / interpretable models. Patterns are much easier to 
interpret than regression coefficients. First of all, OSR takes into account interactions between 
explanatory variables, i.e., the fact that an explanatory variable can have a strong impact in a certain 
context and not be relevant in another one. These interactions do not have to be known before building 
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the model as opposed to interaction terms in logistic regression [HL89], Secondly, as we will see, a 
process (described below) can be defined to show strong associations that exist in a given context (this 
is needed to satisfy R4 of [BBT92]). Finally, the variation in entropy generated by a particular predicate 
can help assess the significance of the impact of an explanatory variable (on the dependent variable) 
within a certain context However, interpreting the raw patterns would force the user to deal with useless 
complexity. Many of these patterns are similar and should not be differentiated. This can prevent the 
user from getting a clear picture of the model trends. Therefore, the patterns generated by the OSR 
process need to be grouped in order to make them more easily understandable and interpretable. This can 
be done using a formally defined statistical process (described below) where the user fixes the desired 
level of "similarity" between pattern by assigning values to a small set of parameters. 

Let us define two patterns PT 1 and PT2: 


PT1: Predi * Predj 
PT2: Predi * Predk 

Suppose in the context where Predi is true, the pattern vector subset for which Predj is true happens to 
show a strong association with the one for which Predk is true. This implies that these predicates 
capture basically the same phenomenon. The strength of the association can be assessed by using 
normalized Chi-squared based statistic such as Pearson's Phi [CA88). A Chi-squared test can be 
performed to assess the statistical level of significance of such an association. The two patterns will be 
merged into one signifying that the selection of one predicate, or the other, during the OSR process, 
occurred randomly. This is a result of slight differences between the two predicates and therefore 
distinguishing between them does not help in the understanding of the object of study. This 
phenomenon is mainly due to complex interdependencies between Xs that are often underlying the 
software engineering data sets. 

In order to decide whether or not two strongly associated predicates should not be differentiated, the 
user declares a Phi value which represents the minimal degree of association necessary to assume two 
predicates as similar. This process of merging patterns based on the similar predicates principle yields 
the resulting pattern PT{1,2} which contains the composite predicate ( Predj v Predk ) implicitly 
meaning that its two component predicates are interchangeable in this context. 


PT { 1 ,2 } : Predi * (Predj v Predk) 

Let us define a composite predicate to simply be a disjunction of predicates. 

Examplell: Assume that in the context of a subsystem that has for focus data processing, 
most of the components with a large number of SLOCs are also the ones with a large Halstead's 
volume V. PT1 and PT2 will be merged if the level of association between the two second 
position predicates (who are in this case singletons) is higher than the "Phi" threshold defined by 
the user. 


PT1: SUBSYSTEM € 
PT2: SUBSYSTEM € 
PT {1,2}: SUBSYSTEM 


REAL-TIME CONTROL * V € LARGE. R = 0.90, RS = 0.06 
REAL-TIME CONTROL * S LOC € LARGE, R = 0.92, RS = 0.07 
<= REAL-TIME CONTROL ^ (V£ LARGE V SLOC € LARGE) 
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R = 0.91, RS = 0.01 


Iii this situation where PT1 and PT2 are both reliable but show a small number of occurrences in 
the specific pattern set (see previous section), then they will be associated with weak levels of 
significance (RS). Merging them will increase this level of significance and keep the reliability 
(R) constant if the used Phi threshold is high enough. 3 

Automated merging of similar patterns can be performed if the user provides either a Phi value or a level 
of significance that corresponds to an unambiguous definition of pattern similarity. 

In a similar manner, we can define a second merging principle. Suppose we have the same two patterns 
as defined above: v 


PT1: Predi a Predj 
PT2: Predi * Predk 

However, this time suppose that Predj is the singleton predicate Xj e Classkm and Predk is the 
singleton predicate X\ e Classy where Classkm is a neighbor class of Classkn (their boundaries may 
overlap). In this particular case, if the two patterns characterize subsets with no statistically significant 
diilerence in distribution on the dependent variable range, then they can be merged. This is because the 
variation from one class to the other seems to have a non-relevant effect on the dependent variable under 
the conte x t where Predi is true. Therefore, in order to assess if merging is possible, the probability that 
differences between distributions are random is calculated. For each dependent variable class, the 
Prcpomons of pattern vectors are compared between the two distributions by calculating the probability 
that difference in proportion is due to randomness. If for all dependent variable classes, the resulting 
minimum probability is above a user-defined critical probability value, we accept the hypothesis that 
there is no significant difference between the two distributions. In the tool developed to support the OSR 
approach, this is calculated through a binomial test for proportions. 

Examplel2: Assume that in the context of components with a large number of SLOCs and a 
large Halstead's volume V, the programmers experience of the programming language (ordinal 
3 sca ^ e 1 * s a significant factor. Both PT1 and PT2 show a first position predicate 
which is the result of a previous merging according to the first principle presented above Their 
second position predicate is similar but not identical. PT1 and PT2 will be merged into PT{ 1,2} 
if the level of similarity between the two second position predicates (who are in this case 
singletons) is higher than the threshold defined by the user. 


PTl: (V e LARGE v SLOC e LARGE) a EXPERIENCE e [1,2) 

PT2: (Ve LARGE v SLOC € LARGE) a EXPERIENCE e [2,3) 

PT{ 1,2}: (V e LARGE v SLOC € LARGE) a EXPERIENCE e [1,3) 

Both of the merging principles defined above can be used simultaneously in order to obtain more 
significant and interpretable patterns. However, the merging process using both of them must be 
carefully defined. We have built a prototype tool where such mechanisms have been completely 
automated. A more precise definition of the pattern merging algorithm is presented in Appendix H 
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3 Validating the Approach 


In order to validate the OSR approach, we need to compare it to standard modeling processes that 
used for classification: logistic regression [HL89], classification trees [S92], 


can be 


Our definition of a high risk component (procedure or function) is: any software component where 
errors were detected during system and acceptance test. Low risk is used to identify the remaining 
components of the system. In particular, we wish to build models that identify high risk components for 
a particular category of errors: ones that characterize an incorrect reading or writing in a variable or a data 
structure. 


3.1 Data Description 

The data set was created using data collected from 146 components of a 260 KLOC Ada system. We 
selected randomly an equal number of both low and high risk components in the used data set. This was 
done in order to construct unbiased classification models. We selected all the high risk components 
identified during test phases and we randomly introduced an equivalent number of low risk components 
among those available. A larger number of low risk components in the data would lead all modeling 
techniques to generate models more accurate for the low risk class and would therefore provide mediocre 
models for the high risk class (i.e., their results would not be representative of the actual capability of 
tne models in terms of accurately identifying high risk components). 


The explanatory variables used to construct the models are static code and design metrics. Some of these 
metrics are taken from a project whose goals were to build multi-variate models of software quality 
based on architectural characteristics of Ada designs [AES90,AE92,AE+92]. Others are well known 
component level complexity and size measures[BP84], We will first summarize the architectural 
approach to measurement taken in this project and then define the assumptions upon which the analysis 
was conducted. ' 


The architectural view of the Ada system can be derived by identifying the major components of the 
system, and determining the relationships among them. The library unit aggregation (LUA), or the 
library unit and all its descendant secondary units [AES90], provides an interesting concept for an Ada 
system. Relationships between LUAs can include the importing relationship, or the relationship between 
an instantiation and its generic template. The increased use of Ada as a design as well as implementation 
language provides an opportunity to better assess the final product in its intermediate stages. Since the 
design and the final product are written in the same language we can use tools developed for analysis of 
Ada source code to provide an automated means for analyzing Ada designs. This automation is essential 
it one is to frequently measure and assess the design. 


The metrics used in this study are derived from the architecture of the system, and were obtained by an 
automated static analysis of the source code using the ASAP static analysis program [Dou871, UNIX 
utilities and the S AS statistical analysis system. At the heart of the measures are counts of declarations 
ui an LUA - whether they are declarations made in the LUA, declarations imported to the LUA (i.e. 
declarations made in another LUA made visible by a “with” clause), declarations exported by the LUA 
(i.e., declarations made in the library unit, and visible to other units that import the LUA) or 
declarations hidden from these importing units (i.e., declarations made in the body and subunits). 


The collection of metrics were developed from hypotheses about the nature of the software design 
process and further details can be found in [AES90,AE92,AE+92], These, in addition to other raw 
measures extracted from the source code were used in this study. The metrics include ratios designed to 
indicate the extent of context coupling, visibility control, locality of imports, and parametrization. These 
characteristics are based on the following underlying assumptions: 


10014023L 


2-19 


• Assumption 1 (Context coupling): Importing and/or exporting large amount of declarations may 
require complex interfacing with the other LUA’s of the system and is expected to be an error- 
prone factor. 

• Assumption 2 (Parametrization): The average number of parameters per program unit 
declaration in the LUA should have an impact on the probability of generating defects. The larger 
the parametrization of the LUA, the larger the number of abstractions to be dealt with, the greater 
the difficultly for a designer or a programmer to keep in memory their respective role, the more 
complex it becomes to handle interaction with others LUA's. 

• Assumption 3 (Visibility control): The ratio of cascaded imports (declaration imports to a unit 
and whose visibility cascades to it's descendent units[AE+92]) to direct imports in the LUA . 
This concept captures the extent to which declarations are imported to where they are needed in 
the LUA. The larger the number of visible declarations unrelated to the problem addressed at a 
particular location in the LUA, the larger the risk of confusion or misunderstanding of those 
program abstractions. 

• Assumption 4 (Reuse): A high ratio of reused code in a LUA denotes the familiarity / 
understanding with the problem addressed and the computer-based solution, i.e., the LUA 
interface with other LUA's, its component interfaces and its data structures. This is expected to 
lower the probability of defect. 

In addition to the architectural metrics mentioned above, two main categories of component complexity 
metrics may be identified as well: size of the component and the structural or control flow complexity of 
the component. 


• Assumption 5 (component size): Different measures of size were used: the total number of Ada 
statements, the number of executable Ada statements and the number of source lines of code. 
Size measures have shown in the literature to be related to the probability of generating defects 
[SP88, MK92]. 


• Assumption 6 (structural complexity): The structural complexity of the code should affect the 
probability of generating complex defects undetected during early walkthroughs and unit test. 


3.2 Evaluating the Accuracy of the Models 

We compare the results obtained using logistic regression and classification trees with those found using 
Optimized Set Reduction. The fully automated OSR process was used to generate the set of patterns 
partially presented in Section 3.3. For each modeling approach, a V-fold cross validation procedure was 
used (BF+84], Each pattern vector was successively removed from the dataset. The model was built 
using the remainder of the dataset and then used to predict the pattern vector extracted. The prediction is 
compared to the actual and this is repeated for each pattern vector in the dataset. Unless the available 
dataset is large, this validation method is preferable: this is an objective validation method (i.e., no 
arbitrarily selection of test sample) that allow model evaluations with a maximum number of 
observations. 

The variable selection process used for building the regression models was a stepwise selection process 
with a predetermined selection criterion of p = 0.05. Dummy variables [DG84] were created in order to 
deal with discrete explanatory variables. Principal components [DG84, HL 89, MK92] have been 
extracted and used in an attempt to optimize the accuracy of the regression models. Two regression 
models were built. The first one is based exclusively on the original explanatory variables. The second 
one uses, as explanatory variables, the generated principal components which are linear functions of the 
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original explanatory variables, where each is orthogonal with respect to the others. With respect to 
classification trees, the algorithm provided by the S-PLUS system [S92] was used and the parameters 
controlling the tree construction were tuned in order to gel optimal accuracies. However, this process 
was quite tedious since no guideline or rational exists for tuning these parameters despite the great 
instability of the generated trees. 

When comparing modeling techniques with respect to identifying high risk components, two different 
evaluation parameters must be considered simultaneously. Assume that when a high risk component is 
identified, a remedial action is taken during the testing phase (e.g., more expensive and more effective 
code reading technique) and that the benefit of this remedial action is validated and quantifiable. We have 
to consider the completeness of the model (i.e., the percentage of high risk components identified by the 
model. The benefit of this remedial action on the development process quality will be a function of 
completeness since the larger the number of high risk components identified, the higher the error 
detection rate. Also, the correctness of the model (i.e., the percentage of components identified as high 
risk that are actually of high risk) allows the user to quantify the waste of resources due to the 
unnecessary applications of remedial actions. 

Table 1 shows these two parameters for logistic regression, classification trees and Optimized Set 
Reduction. OSR appears to be more accurate than both logistic regression and classification trees with 
respect to all the criteria considered. We conclude that the benefits of the remedial actions taken when 
identifying high risk components are increased using OSR. These results seems to indicate an 
improvement of the OSR algorithm when compared with the earlier version presented in [BBH92] 
where there was no significant accuracy differences when compared with logistic regression. 

The results shown in Table 1 have been obtained following the classification rules below: 

• Logistic regression: if the calculated probability of a component belonging to the high risk class 
was below 0.5, the low risk class was selected. Otherwise, the high risk class was selected. 

• Classification trees: The risk class was selected based upon the proportion of non-faulty and faulty 
components in the matching tree leaf. 

• OSR: For a given component, all the significantly reliable extracted patterns were considered for 
performing the classification. If those patterns all showed a high probability in the same risk class, 
then that class was selected. Otherwise, the risk class characterized by the pattern subset with the 
highest average pattern reliability was selected. If none of the extracted patterns happened to have a 
reliability significantly different from the random expected reliability, then the component was 
considered undetermined and thus classified randomly among the two risk classes. 

By selecting biased classification rules (e.g., 0.4 decision boundary for logistic regression), the model 
completeness and correctness could be modified. However, when completeness increases, correctness 
decreases and vice-versa. The best correctness / completeness tradeoff depends on the particular 
application of the model. The results below were obtained using unbiased classification rules. 


Model 

Correctness 

Completeness 

Optimized Set Reduction 

92.11% (70/76) 

95.89% (70/73) 

Classification trees 

83.33% (60/ 72) 

82.19% (60/73) 

Logistic regression without 
Principal components 

76.56% (49/ 64) 

67.12% (49/73) 

Logistic regression with 
Principal components 

80.00% (52 / 65) j 

71.23% (52/73) 
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Table 1: Model Accuracies 


3.3 OSR Patterns' Interpretations 

Comparison between the interpretability of logistic regression equations and OSR patterns may be found 
in [BTH93]. Issues associated with classification tree interpretation are discussed in [BBT92], In this 
section, we illustrate and evaluate the interpretability of OSR patterns. Some of the patterns 
characterizing "data value / structure" errors will be described in order to illustrate the interpretation 
process in the OSR context. Patterns will be presented in a format facilitating their readability. Class 
boundaries will not be shown since they are not meaningful to the reader. Instead their corresponding 
quantiles on the explanatory variable range (in the appropriate context) will be used to describe 
predicates. 

3.3.1 Regression Equation 

The regression equation generated is as follows: 


Log 

V 



1-pJ 


= 0.337 + 0.0103 SLOC - 0.00107 LUADA - 1.8274 LUFREUC 


where p = Prob(component is high risk)) 

One of the main problems of logistic regression models with respect to their interpretation is the inherent 
instability of regression coefficients when the underlying assumptions of the model are not met (see 
[BTH93] for example and details). In some cases, looking at the correlation matrix may help avoid the 
problem when interpreting. Another related problem is that many good predictors were not selected by 
the stepwise selection process because of a strong correlation with already included parameters. In order 
to interpret the regression equations, the user has to look carefully at the correlation matrix and the 
regression equation in order to have some meaningful insight into the associations between explanatory 
variables and the dependent variable. Instability may be due to other causes like overinfluential data 
points (outliers) or interactions between explanatory variables [DG84, HL89]. 

We will demonstrate in the next paragraphs that, on our dataset, logistic regression does not extract a lot 
of the information which is provided by the data set. Some of the assumptions made in 3.2.2 will be 
supported by the OSR patterns. 

3.3.2 Patterns for Data Value / Structure Errors 


The patterns listed below are the ones that seemed to confirm the assumptions stated in section 3. 1 . Our 
goal was not to make assumptions based on the generated patterns since this is a risky and dangerous 
approach to data analysis, i.e., exploratory data analysis. As a matter of fact, many of the generated 
patterns were not clearly understandable to us and did not fit in our list of assumptions. Generating 
interpretable patterns does not imply generating easy to understand patterns, which is due to the indirect 
and complex nature of some of the statistically significant associations extracted from our data sets. 
Moreover, since statistical models do not deal with causality, interpretation becomes an even more 
sensitive process. 

Patterns are grouped according to the assumption they support For each pattern presented, the entropy 
associated with each predicate (here singleton predicates) is shown just below the predicate itself. 
Patterns were generated entirely automatically without human intervention. As opposed to the 
classification tree approach [S92], no "tuning" of the algorithm was necessary since the parameters of 
the OSR algorithm are all intituively meaningful (e.g., user set statistical levels of significance for 
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differentiating distributions) and can be set at once. The predicates' value intervals have been calculated 
automatically according to the procedure described in Section 2.2.1. This approach for handling 
predicate intervals automatically and dynamically (classes change in various contexts) gives more 
meaning to the interpretation of the OSR patterns. The first group of patterns is commented in detail in 
order to remind the reader about how to read these patterns. A definition of the metrics appearing in the 
patterns presented below is provided in Appendix I. 


• Pattern Group 

NDMAX € 


1: Complex code within a largely reused LUA (Assumptions 4 and 6) 
[52% - 100%] * LUFREUS e [0% - 71%) => High Risk 


H = 0.89 


H = 0.73 


NDMAX G [52% - 100%] 
H = 0.89 


* LUFREUC e [0% - 81%] => High Risk 
H = 0.75 


Picking those components with a relatively small amount of reuse within the subset whose maximum 
statement nesting level is high implies a high probability that the component will be in the high risk class 
(i.e., to generate errors). 


The individual impact of predicates (here all singletons) on the risk (i.e., probability to be in the high 
risk class) can be quantified by looking at the entropy variation they generate. NDMAX e [52% - 
100%] creates a variation of entropy of 0.11 (from 1.0, the initial set entropy, to 0.89). In this context, 
a variation of entropy of 0.16 can be observed for LUFREUS e [0% - 71%] (from 0.89 to 0 73)’ 
However, there is no strong evidence that the amount of reuse in a LUA is a high risk characteristic 

when NDMAX e [52% - 100%] . In other words, this pattern group seems to indicate that 
architectural reuse pays off in terms of defect probability only in the context of complex components. 

• Pattern Group 2: Large compilation units within a LUA with a high level of parametrization 
(Assumptions 2 and 6 ). 


( SLOC e [57% - 100%] V V € [54% - 100%] ) * LUPARPD € [53% - 100%] => High Risk 
H = 0.84 H = 0.46 

LUPARPD is an indicator of the average program unit interface complexity within a particular LUA. 
This complexity seems even more difficult to handle for large components (i.e., large number of lines of 
code, operands and operators). Based on the process defined in section 2.4, the reliability of this pattern 
has been assessed at 100% and appears to be significant at RS = 0.06. Since this data set is small, 
T C T^A , ^'nTs e - W P 2 ^ 171 ^ s ^ow significances below 0.1. Here again, there is no strong evidence that 
LUPARPD is a high risk characteristic in the context of small components. Large components with 
complex interfaces are risky while small components do not seem to be strongly affected. 


* Pattern Group 3; Large and complex compilation units within a LUA containing high quantities of 
cascaded imports (Assumptions 3, 5 and 6). 

( SLOC e [57% - 100%] V V € [54% - 100%] ) A LUACTMAX € [64% - 100%] => High Risk 
H = 0.84 H = 0.0 

NDAV€ [65% - 100%] a LUCMIMP € [36% - 100%] => High Risk 
H = 0.92 H = 0.0 

Importing large quantities of cascaded declarations seems to significantly increase the risk of defects 
even in the context of large and/or complex components, i.e., large number of lines of code, operands 
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and operators. Once again, small components do not seem to be affected. 

In this pattern, the first predicate is an example of composite predicate and is the result of the merging 
process. Phi (i.e., the merging criterion) was fixed to 0.7. 

• Pattern Group 4: Complex compilation units in the context of a LUA that exports/imports large 
quantities of declarations towards other LUA's (Assumption 1, 5 and 6 ). 

LUWBYCU e [79% - 100%] * DOBJ € [46% - 100%] => High Risk 

H = 0.78 H.= 0.34 


LUWBYCU e [79% - 100%] * VG € [26% - 100%] => High Risk 

H = 0.78 H = 0.44 

LUCC e [93% - 100%] => High Risk 
H = 0.0 

This pattern group seems to indicate that interfacing with other compilation units in order to export 
complex compilation unit (i.e., large number of declared / defined variables or a large cyclomatic 
complexity) shows a high defect risk. These patterns illustrates how the notion of context can play an 
important role when determining the impact of an explanatory variable. This shows that when one wants 
to validate assumptions, the answer may not be as simple as yes or no. In our particular example, most 
of the assumptions would not have been validated by simply looking at the regression model [CAP88]. 

• Pattern Group 5: When average statement nesting level is high, the "size" of the component is large 
and this component has an ALgorithmic / COMPutational functionality (according to the NASA SEL 
taxonomy), then there is a high probability that the component is high risk. Note that this is an example 
of the use of non-singleton predicates. 


NDAV € [65% - 100%] 
H = 0.92 


( ALCOMP YES a ( SLOC € [15% - 100%] 

TOTASTMT € 


H = 0.75 


V V e [19% - 100%] v 
[23% - 100%]) ) 


4 Conclusions 

Five main conclusions can be drawn from this paper: 

(1) Based on a rather small and incomplete data set, i.e., 146 Ada components, a completeness and a 
correctness above 90% has been obtained by using the OSR modeling process. If this level of 
accuracy is not sufficient, the user can tune the decisions boundary so he may increase either the 
correctness or completeness according to her/his specific needs. 

(2) OSR Patterns appear to be more stable and interpretable structures than regression equations 
when the theoretical underlying assumptions are not met. Taking effective corrective actions is only 
possible when the impact of controllable factors on the parameters to be controlled (e.g., cost, 
quality) can be fully understood and quantified. 

(3) OSR Patterns seem to generate a more complete set of information, i.e., validate more 
assumptions, than the logistic regression equation. 11115 may be partially corrected by looking at the 
explanatory variable correlation matrix. However, this is an extremely tedious and not always 
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helpful task, e.g., issues like interactions between explanatory variables are still not addressed. 

(4) OSR classifications were found to be more accurate than logistic regression equations. This also 
confirms previous studies showing similar results for other kinds of applications [BBT92, BTH93], 
Therefore, the Optimized Set Reduction approach seems to be a good alternative and/or complement 
to multivariate logistic regression in this application domain. 

(5) OSR classifications were found to be more accurate than a classification tree. This also confirms 
earlier results we obtained on the datasets used in [BTH93] where classification trees were 
performing poorer than both logistic regression and OSR. These results seem to suggest that the 
classification tree structure, even though simple to generate and use, might be too simplistic for 
modeling complex artifacts such as high risk components. 

From a more general perspective, the OSR approach is a data analysis framework that successfully 
integrates statistical and machine learning approaches in empirical modeling with respect to specific 
software engineering needs: it provides support for dealing with both partial information, model 
interpretation and is not based on a severely constraining set of hypotheses. 
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Appendix I: Definitions or the metrics appearing in the paper 

Library Unit Aggregation (LUA) metrics: 

. LUACTMAX: total number of cascaded program unit declarations / maximum possible number of 
cascaded program unit declarations 

. LUCMIMP: cascaded imported program unit declarations / direct imported program unit 
declarations 

. LUWBYLU: number of library unit aggregations that contain a with statement to this compilation 
unit 

. LUWBYCU: number of compilation units that contain a with statement to this compilation unit 
. LUPARPD: number of parameters per program unit declaration in the LUA 
. LUFREUC: fraction of old (reused verbatim) number of components in the LUA 
. LUFREUS: fraction of old (reused verbatim) number of SLOC's in the LUA 
. LUADA: number of Ada statements in the LUA 
. LUCC: unique Imported declarations / unique exported declarations 
Compilation unit metrics: 

. NDMAX: maximum statement nesting level 
. NDAV: average statement nesting level 
. SLOC: source lines of code 
.V: Halstead's volume 
. VG: cyclomatic complexity 
. DOBJ: number of declared variables 
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Appendix II: Algorithms 


The Merging Algorithm 

This merging process can be formalized using the following definitions and algorithms: 

Recall the definition of predicate and composite predicate from section 2.1.1 and 2.4. Let cp represent a 
composite predicate. Then, we define: F 


• Definition Al: A context (C) is an ordered conjunction of composite predicates that defines 
subset of pattern vectors PSS (i.e., PSS = SUBSET(PVS, C)). 


•Definition A2: An association coefficient a£ i s a n assigned statistical degree of association 
res7= SUBSET(Ss"cpj) aU “* PSS = SUBSET(PVS - c >- Le> PSSj = SUBSET(PSS, cpi) and let 

A two row-two column contingency table is defined as shown in Figure 5. 


PSSj 


PSS-PSS i 


PSSj PSS -PSSj 


I PSSj A PSSj | 

I PSSj A (PSS- PSSj) | 

IcPSS-PSSpAPSSjl 

(PSS-PSSj) A 

(PSS -PSSj) 


Figure 5: Predicate Association 

Based on this table, a Chi-Square based statistic (Pearson's Phi), the degree of association between cpi 

and cpj in PSS is calculated and assigned to . Note that this association coefficient is calculated in the 
context of C (i.e., PSS = SUBSET(PVS, C)) and therefore is only valid under C. 

• Definition A3: An association matrix A^„ is a square matrix of association coefficients calculated 
under a context C, where the rows / columns are marked by composite predicates. 

example : A^ n contains all a£, i,j e {l,...,n} 


•Definition A4: Two composite predicates cpi and cpj are said to be similar in the context of C if a£ 
- PHI (the minimal level of association defined by the user).This association will be denoted as 
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cpi - cp r 


• Definition A5: A predicate tree is a tree representation of the patterns generated when extracting the 
specific pattern set (S PS) process. As mentioned is Section 2.4, the SPS is a set of patterns 
representing the observed trends in the historical data set. It is expected that a significant number of 
these patterns will be duplicated or similar. This representation is a compact way of representing the 
SPS. Each path of a predicate tree represent a pattern (see Figure 6) " 



Note that in the above example, all of the predicates are singleton. This could represent a predicate tree 
which summarizes an OSR run. During the merging process, branches will be merged and composite 
predicates created at the nodes. 

• Definition A6: Two composite predicates cpi, cpj are said to be "mergable neighboring composite 
predicates" if the following conditions are fulfilled: 

(1) There exist two predicates Pred m and Pred n , where Pred m = (Xi e classic) and Pred n = (Xj <= 

Classit ) (both are singleton predicates) such that Pred m and Pred n are each disjuncts in cpi and cpj, 
respectively. J 

(2) Classy and Classit are neighboring (or overlapping) classes on variable Xi domain. 

(3) cpi and cpj yield similar distributions on the dependent variable range, (i.e., the level of 
significance of the two distributions being different is above S (user defined)). 


If these three conditions are true, then MNCP(cpi, cpj, S) is TRUE. 


We can now define the merging algorithm as follows: 
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procedure MERGE (predicate tree, node, context, PHI, S) 

O) (node is a terminal node of the predicate tree) 

then RETURN 

(2) while (3 cp| , cpj such that MNCP (cp d , cp j( S) ) do 

UNION (predicate tree, node, cpi, cp j ) 

(3) Calculate A^J cexc 

(4) while (3 cp 1( cpj such that cp A - cp-j ) do 

. select cpi and cpj such that af^ ntext is the strongest association in A^ Cext: 
. UNION (predicate tree, node, cpi, cp j ) mX ‘" 

. recalculate the association matrix for 

^ ^ ' ' * c Pi ~ 1 ' c Pi + 1 > • • • < C P j -1 > c Pj + 1 , . . . , cpm, c P^u j context . 

(5) for each successor of node in predicate tree 

MERGE (predicate tree, successor, context * cp no de, PHI, S) 
end MERGE 

In step (4), a call is made to procedure UNION which is defined as follows: 

procedure UNION (predicate tree, NODE, cpi , cp j ) 

(1) Form a new node marked by the composite predicate cp^ U cpj (i.e., cp ■ ) 

(2) Delete nodes marked by cpi and cpj under NODE lU:) 

(3) Combine all like subpaths rooted at cp^-j 

end UNION 

The merging process is initiated with the procedure call: 

MERGE (predicate tree, root, 0, PHI, S) 
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Discretization Algorithm 
Procedure parameter definitions: 

. EV: the explanatory variable whose range is going to be discretized 
. DV: the dependent variable of the model to be built 
. dataset: set of pattern vectors to be discretized along the scale of variable 
. criterion: maximum level of significance accepted to recognize two distributions as different 
. classes: the definiton of the intervals (classes) on variable's range, i.e., a set of pairs of boundaries 

proc*dur« DISCRETIZATION (dataset, EV, DV, criterion, classes) 

(1) sort dataset elements in increasing order according to elements * variabl e 
values 

(2) OPTIMAL_SPLIT( dataset, EV, DV, criterion, optimal_bound) 

(3) if (dataset has actually been split in (2)) 

tbsn { 

(3.1) update the definition of classes with newly calculated 
optimal bound 

(3 .2) extract two subsets s s e 1 1 , s s e t 2 of dataset wh ere 

v&riable < op t ima Inbound and variable > optimal bound, 

respectively 

(3.3) DISCRETIZATION (ssetl, EV, DV, criterion, classes) 

(3.4) DISCRETIZATION (ssetl, EV, DV, criterion, classes) 

) 

®nd DISCRETIZATION 


The procedure for splitting datasets may be 
procadur® OPTIMAL_S?LIT (dataset, EV, DV, 


defined as follows: 
criteria, opt imal_bound) 


for all data vectors V± in dataset (in sorted order) 

{ 

Case 1: there is a change in DV value but not in EV value 
{ homogeneous = FALSE ) 

Cass 2: there is a change in EV value (from EVV1 to EW2 ) and while EV 

values remained constant and equal to EW1, homogeneous remained eoual to 
TRUE 

{ 

/* 

STEP1 : calculate entropy of the distribution on the DV range for the 

dataset subset lying in the interval strictly below EW2 (SSET2) 

STEP2 : Calculate the level of significance of the difference in 

distribution between dataset and SSET2 . 

STEP3 : If the the level of significance is below criterion and the 

entropy is below the minimal entropy calculated so far, then 
optimal_bound is assigned with EW2 
V 

Entropy 2 = H(SSET2, DV) 
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s2 = DIFFDIST (dataset, SSET2 , DV) 
if (Entropy 2 < H (dataset, DV) and s2 < criterion) 
then opt ima l_bound = EW2 


} 

Case 2 1 there is a change in EV value (from EVV1 to EW2) and while EV 
values remained constant and equal to EW1, DV values changed at least once 
and homogeneous = FALSE 
{ 

/* 

SSETl is the dataset subset lying in the interval strictly below EWI . 

STEPl : calculate entropy of the distribution on the DV range for the 
dataset subset lying in the interval strictly below EW1 (SSETl) 

STEP2 : Calculate the level of significance of the difference in 

distribution between dataset and SSETl. 

STEP3 : If the the level of significance is below criterion and the 
entropy is below the minimal entropy calculated so far, then 
opti/nal_.bound is assigned with EWI 

STEP 4 : repeat same procedure as above for SSET2 

STEP 5 : set homogeneous to TRUE 

*/ 

Entropyl = H (SSETl, DV) 

si = DIFFDIST ( dataset , SSETl, DV) 

if (Entropyl < H (dataset, DV) & Entropy 1< optimal_bound & si < criterion) 
then optimal_bound = EWI 
Entropy2 = H(SSET2, DV) 
s2 = DIFFDIST (dataset, SSET2 , DV) 

if (Entropy2 < H (dataset, DV) £, Entropy2 < optimal_bound & s2 < criterion) 
then opt ima l_bound = EW2 
homogeneous = TRUE; 

} 

Case4 : no change in DV value 
/* Do nothing */ 

) /* end of for loop */ 
end OPTIMAL_SPLIT 
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